fix(inference_provider): normalize provider endpoint errors by rodboev · Pull Request #1797 · NVIDIA-NeMo/Gym

rodboev · 2026-06-27T04:43:26Z

Summary

Normalize provider endpoint failures in responses_api_models/inference_provider so /v1/chat/completions and /v1/responses return structured provider-aware errors instead of falling through to the generic inner-server 500 path.

Closes #1748

Background

InferenceProvider.chat_completions() currently lets aiohttp.ClientResponseError escape directly from create_chat_completion(). Shared middleware then collapses that into the same generic JSON 500 string for auth failures, rate limits, missing models, and upstream 5xx responses. As more OpenAI-compatible providers land, that hides whether a failure is retryable and strips the provider context callers need to debug the endpoint.

Changes

catch ClientResponseError locally in responses_api_models/inference_provider/app.py and raise a structured HTTPException before shared middleware flattens the failure
normalize the surfaced payload to include provider_status, retryable, provider_context, model, category, and a concise provider-derived message
classify provider-neutral categories for authentication, request errors, model-not-found, rate limits, transient upstream failures, and fallback provider errors
add focused endpoint tests that cover structured auth failures, request errors, retry-exhausted 500 failures through the shared retry path, status-zero fallback, and the /v1/responses converted path
add helper coverage for retryable-status normalization and plain-text provider bodies without over-claiming route-level retry behavior
add direct helper coverage for top-level message and detail extraction, raw JSON fallback, string and None response_content, truncation, and the message-derived classification branches

Out of scope

no retry policy changes
no edits to nemo_gym/openai_utils.py or shared middleware
no docs, config, or dependency updates

Validation

Not run locally: uv run pytest responses_api_models/inference_provider/tests/test_app.py -x
Not run locally: uv run pre-commit run --files responses_api_models/inference_provider/app.py responses_api_models/inference_provider/tests/test_app.py
Passed locally: ruff check --config pyproject.toml responses_api_models/inference_provider/app.py responses_api_models/inference_provider/tests/test_app.py
Passed locally: ruff format --config pyproject.toml --check responses_api_models/inference_provider/app.py responses_api_models/inference_provider/tests/test_app.py

Notes

Native Windows uv run still fails in this repo because uvloop cannot build on Windows during dependency resolution. That is the same known environment limitation called out in the local repo config, so the focused pytest and pre-commit commands remain unchecked here and CI is the authoritative proof surface for the behavioral tests added in this slice.

Signed-off-by: Rod Boev <rod.boev@gmail.com>

copy-pr-bot · 2026-06-27T04:43:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

rodboev added 5 commits June 27, 2026 00:08

Preserve provider endpoint failure context

c3a47b3

Signed-off-by: Rod Boev <rod.boev@gmail.com>

Harden provider error classification coverage

2ba4772

Signed-off-by: Rod Boev <rod.boev@gmail.com>

Lock provider error fallback behavior

8269b5e

Signed-off-by: Rod Boev <rod.boev@gmail.com>

Cover empty provider error bodies

9773f01

Signed-off-by: Rod Boev <rod.boev@gmail.com>

Align retry-path proofs with actual client behavior

ba8cfe1

Signed-off-by: Rod Boev <rod.boev@gmail.com>

nemo-automation-bot Bot added the community-request Issue reported or requested by someone from the community label Jun 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(inference_provider): normalize provider endpoint errors#1797

fix(inference_provider): normalize provider endpoint errors#1797
rodboev wants to merge 5 commits into
NVIDIA-NeMo:mainfrom
rodboev:pr/inference-provider-error-normalization

rodboev commented Jun 27, 2026

Uh oh!

copy-pr-bot Bot commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rodboev commented Jun 27, 2026

Summary

Background

Changes

Out of scope

Validation

Notes

Uh oh!

copy-pr-bot Bot commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant